qqmm #2789

nastya236 · 2025-11-18T18:43:27Z

This PR adds a new operation mx.qqmm. The current structure is probably neither optimal nor final.

General comment

For inference we want to support: qqmm(quantized weights, bf16 activations).
For training (vjp) we unfortunately still need bf16 weights for two reasons:
- We currently do not have 2D scaling for nvfp4, so we need to transpose and quantize again along a different dimension.
- For mxfp8, the recommended recipe is to quantize with 1D blocks and keep two views of the weights (normal and transposed).

Therefore, mx.qqmm takes bf16 activations x, quantized weights w_q and their scales, and optionally bf16 weights plus group_size, mode, and bits.

In the current implementation, it is the user’s responsibility to ensure that group_size, bits, and mode match those used to quantize w_q. This is probably not ideal, and we may want to improve this in the future.

Very important details

scales are repacked on every call for both weights and activations. In the future, we probably want to:
- Avoid repacking weight scales for inference.
- Fuse quantization and repacking, and directly pack into swizzled layout in fp_quantize.
Batched qqmm is currently not supported; inputs must be 2D. For now it is implemented this way because:
- CUBLASLT_BATCH_MODE_STRIDED is not supported for scales.
- CUBLASLT_BATCH_MODE_POINTER_ARRAY is not supported for arrays with block scaling.

We almost certainly want to add batching in the future, but for simplicity batch_count = 1 for now.

qqmm is always executed in TN layout (transpose = True).
There are several reasons for this, but mainly we always quantize along the reduction dimension, which currently ends up being the last dimension.. I am happy to change this if you think that it is useful to support all layouts for mxfp8 for example. Also, only on B200 only TN layout is supported for nvfp4 and mxfp4.

Notes

There are some changes to cublas_gemm.cpp: I grouped all common cuBLAS-related functions into a separate helper class in cublas_utils.cpp.
mxfp8 qqmm behaves slightly differently from nvfp4: sometimes, for <<1% of the output elements, the result differs from the dequantized reference by exactly 1 ULP in bf16 (see python/tests/test_quantized.py, line 1027). I do not think this is a bug because:

For nvfp4 the output matches exactly for every tested shape.
The difference is not structured: there is no clear pattern, and the indices of the affected elements change with the seed.
The mismatch is always exactly 1 ULP.

Therefore, I attribute this to differences in accumulation on tensor cores or other numerical details we do not control.

What this PR lacks [these] because I first want to make sure the rest of the API looks reasonable

addmm -- basically c is always nullptr
nn.QQLinear
nn.Linear.to_qqlinear - or similar method to cast to nn.QQLinear (naming is questionable)

Examples are in python/tests/test_quantized.py.
Happy to iterate and change anything here!

This reverts commit 7a012a7.

This reverts commit e341572.

awni · 2025-12-02T15:39:13Z

mlx/primitives.h

  bool transpose_;
 };

+class DualQuantizedMatmul : public UnaryPrimitive {


A bit of a nit but I think it makes sense to rename this to QuantizedQuantizedMatmul or QQMatmul to better match the name of the op. Dual is also kind of an overloaded term.

Yes, I agree. I think QQMatmul is better, because then the primitive name and the op name are aligned.

awni · 2025-12-02T15:40:17Z

mlx/primitives.h

+  bool is_equivalent(const Primitive& other) const override;
+  std::vector<Shape> output_shapes(const std::vector<array>& inputs) override;
+  auto state() const {
+    return std::make_tuple(group_size_, bits_, mode_);


transpose_ should be part of the state here.

Yeah, this is a bit unclear and probably should be changed.. transpose is not a member variable, qqmm is always executed in TN layout (transpose = True). I did it this way because, at the moment, quantization always produces a row-major tensor with the last dimension packed, and TN is the only layout supported for mxfp4 and nvfp4 on B200.

I see it below in the list under private:. Maybe it should be deleted?

awni · 2025-12-02T15:45:44Z

python/tests/test_quantized.py


            ds = mx.grad(gmm)(s, x, wq)

+    def test_qqmm(self):


These tests will should only be run for now if mx.cuda.is_available().

And in fact I'm not sure what the behavior is on older hardware and CUDA toolkits. Do you know what the minimum requirements there are?

awni · 2025-12-02T21:10:21Z

mlx/ops.cpp

+    std::optional<int> bits_ /* = std::nullopt */,
+    const std::string& mode /* = "nvfp4" */,
+    StreamOrDevice s /* = {} */) {
+  // currently only simetric quantization is supported for qqmm


Suggested change

// currently only simetric quantization is supported for qqmm

// currently only symmetric quantization is supported for qqmm

awni · 2025-12-02T21:11:10Z

mlx/ops.cpp

+  if (qmode == QuantizationMode::Affine) {
+    std::ostringstream msg;
+    msg << "[qqmm] Affine quantization is not supported for qqmm.";
+    throw std::invalid_argument(msg.str());
+  }


It looks like this was already checked above?

awni · 2025-12-02T21:12:11Z

mlx/ops.cpp

+  // https://docs.nvidia.com/cutlass/4.2.1/media/docs/cpp/blackwell_functionality.html
+  // because w_q should always be quantized along the reduction dimension
+  // and we quantize so that the last dim is packed, we assume that the last dim
+  // always the reduction dim so the firat argument in cubals column major is


Suggested change

// always the reduction dim so the firat argument in cubals column major is

// is always the reduction dim so the first argument in cublas column major is

This comment feels like it belongs in the cublas implementation rather than here.

awni · 2025-12-02T21:13:14Z

mlx/ops.cpp

+  auto [w_inner_dims, w_outer_dims] =
+      extract_qqmm_dims("qqmm", x, w_q, scales_w, w, group_size, bits);
+
+  // we don't backprope through qunatized w and scales


Suggested change

// we don't backprope through qunatized w and scales

// we don't backprop through quantized w and scales

awni · 2025-12-02T21:18:22Z

mlx/ops.cpp

+  auto dtype = bfloat16;
+  // out dtype can be only bf16 for now


Why this limitation? It looks like the op can output bf16, fp16 or fp32: https://docs.nvidia.com/cuda/cublas/#id103.

The API should infer the output type from x.

awni · 2025-12-02T21:20:51Z

mlx/ops.cpp

+  validate_quantized_input(
+      tag, w_q, scales_w, "weight matrix", "scales_w", group_size, bits);


I think you can remove the strings here since x is not quantized. The original error message prior to the diff here makes sense.

awni · 2025-12-02T21:24:32Z

mlx/ops.h

+    array x, // input activations
+    array w_q, // quantized weights
+    array w_scales,
+    std::optional<array> w = std::nullopt, // optional bf16 weights for vjp


I don't really love this API where sometimes it takes a w as input and sometimes not. I wonder if it makes sense to change it to something like:

Suggested change

array x, // input activations

array w_q, // quantized weights

array w_scales,

std::optional<array> w = std::nullopt, // optional bf16 weights for vjp

array x, // input activations

array w, // possibly quantized weights

std::optional<array> scales, // scales for w, if not provided `w` must be unquantized

So then it will quantize on the fly if w is not quantized and otherwise it will just use w as is.

And in order to take a vjp, w has to be provided unquantized.

awni · 2025-12-02T21:26:21Z

mlx/primitives.cpp

+          bits_,
+          qmode,
+          s); // (K, N_packed), scales
+      vjps.push_back(qqmm(


A minor problem here is that this function is only once differentiable. I think changing the API as suggested above migth fix that. You always quantize the inputs on the fly when you want gradients.

nastya236 added 3 commits November 18, 2025 19:08

qqmm

3ed3b11

merge main

56a21b9

merge main

54f9958

nastya236 marked this pull request as draft November 18, 2025 18:43

nastya236 changed the title ~~qqmm~~ [WIP] qqmm Nov 18, 2025

nastya236 and others added 25 commits November 19, 2025 18:17

refactoring

61e30ea

refactoring

9e4879b

quantize activations on the fly

df45b39

quantize in eval

7a012a7

Revert "quantize in eval"

e341572

This reverts commit 7a012a7.

Revert "Revert "quantize in eval""

de49f80

This reverts commit e341572.

on the fly activation quantization

545dded

pre-commit

5318b38

qqmm inputs bf16 second arg

2184744

fix

61d09ea

bf16 weights are optional

6af6eb3

op in python, typos

1e37ef6

typo

9c584f8

batched qqmm

9a83d3c

delete batching

88361a3

string instead of stringz-view

b9e73ab

add 2D input condition

95c275b

force transpose

64b8cbe

fix transpose

4b68595

add pythong tests

ee0ea9f

added qq linear

cc0333e

added tests

34f42fb

docs correctlion

9184f9a

small fixes

110848f

deleted qqlinear for now

6633c4b

nastya236 added 2 commits November 29, 2025 20:10

deleted unused header

a71e436

Merge branch 'main' into qq-matmul

4030486

nastya236 marked this pull request as ready for review November 29, 2025 20:20

delete debuging print

b36c6d7

nastya236 changed the title ~~[WIP] qqmm~~ qqmm Nov 29, 2025

awni reviewed Dec 2, 2025

View reviewed changes

	// currently only simetric quantization is supported for qqmm
	// currently only symmetric quantization is supported for qqmm

	// always the reduction dim so the firat argument in cubals column major is
	// is always the reduction dim so the first argument in cublas column major is

	// we don't backprope through qunatized w and scales
	// we don't backprop through quantized w and scales

		validate_quantized_input(
		tag, w_q, scales_w, "weight matrix", "scales_w", group_size, bits);

qqmm #2789

Are you sure you want to change the base?

qqmm #2789

Uh oh!

Conversation

nastya236 commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

General comment

Very important details

Notes

What this PR lacks [these] because I first want to make sure the rest of the API looks reasonable

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nastya236 Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awni Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awni Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nastya236 commented Nov 18, 2025 •

edited

Loading

nastya236 Dec 2, 2025 •

edited

Loading

awni Dec 2, 2025 •

edited

Loading

awni Dec 2, 2025 •

edited

Loading